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We assume that we have observational data generated from an 
unknown underlying directed acyclic graph (DAG) model. A DAG is 
typically not identifiable from observational data, but it is possible 
to consistently estimate the equivalence class of a DAG. Moreover, 
for any given DAG, causal effects can be estimated using intervention 
calculus. In this paper, we combine these two parts. For each DAG in 
the estimated equivalence class, we use intervention calculus to esti- 
mate the causal effects of the covariates on the response. This yields 
a collection of estimated causal effects for each covariate. We show 
that the distinct values in this set can be consistently estimated by 
an algorithm that uses only local information of the graph. This lo- 
cal approach is computationally fast and feasible in high-dimensional 
problems. We propose to use summary measures of the set of pos- 
sible causal effects to determine variable importance. In particular, 
we use the minimum absolute value of this set, since that is a lower 
bound on the size of the causal effect. We demonstrate the merits of 
our methods in a simulation study and on a data set about riboflavin 
production. 

1. Introduction. Our work is motivated by the following problem in bi- 
ology. We want to know which genes play a role in a certain phenotype, 
say a disease status or, in our case, a continuous value of riboflavin (vi- 
tamin B2) production in the bacterium Bacillus subtilis. To be more pre- 
cise, our goal is to infer which genes have an effect on the phenotype in 
terms of an intervention. If we knocked down single genes, which of them 
would show a relevant or important effect on the phenotype? The difficulty 
is, however, that the available data are only observational. For our con- 
crete problem, we observe the logarithm of the riboflavin production rate 
as a continuous response and expression measurements from essentially the 
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whole genome of B. subtilis as high-dimensional covariates. Using such ob- 
servational data, we want to infer all (single gene) intervention effects. This 
task coincides with inferring causal effects, a well-established area in Statis- 
tics (e.g., [5, 8, 10, 11, 13, 18, 24, 25, 26] and [31]). We emphasize that, in 
our application, it is exactly the intervention or causal effect that is of in- 
terest, rather than a regression- type effect of association. If we can estimate 
the intervention effects from observational data, we can score each gene ac- 
cording to its potential to have an intervention (knock-down) effect on the 
riboflavin production rate, and the most promising candidate genes can be 
tested afterward in biological experiments. 

Pearl ([25], page 285) formulates the distinction between associational and 
causal concepts as follows: "an associational concept is any relationship that 
can be defined in terms of a joint distribution of observed variables, and a 
causal concept is any relationship that cannot be defined from the distribu- 
tion alone. . . . Every claim invoking causal concepts must be traced to some 
premises that invoke such concepts; it cannot be inferred or derived from 
statistical associations alone." Thus, in order to obtain causal statements 
from observational data, one needs to make additional assumptions. One 
possibility is to assume that the data were generated by a directed acyclic 
graph (DAG) which is known beforehand. DAGs describe causal concepts, 
since they code potential causal relationships between variables: the exis- 
tence of a directed edge x — > y means that x may have a direct causal effect 
on y, and the absence of a directed edge x — > y means that x cannot have 
a direct causal effect on y (see Remark 2.3 for a definition of direct causal 
effect). 

Given a set of conditional dependencies from observational data and a 
corresponding DAG model, one can compute causal effects using interven- 
tion calculus (e.g., [24] and [25]). In this paper, we consider the problem of 
inferring causal information from observational data, under the assumption 
that the data were generated by an unknown DAG. This is a more realistic 
assumption, since, in many practical problems, one does not know the DAG. 
In this scenario, the causal effect is typically not defined uniquely, and that 
is not surprising, given the description of causality by Pearl [25] above. 

A DAG is typically not identifiable from observational data, because 
conditional dependencies only determine the skeleton and the so-called v- 
structures of the graph. The skeleton and ^-structures determine an equiva- 
lence class of DAGs that all correspond to the same probability distribution. 
This equivalence class can be described by a completed partially directed 
acyclic graph (CPDAG) (see Section 2.1). 

The existence of the equivalence class opens the way to the following 
strategy. Suppose that we are interested in the causal effects of a covariate 
X{ on a response Y. We are given the joint distribution of Xi, . . . ,X p ,Y, 
and use this to find the equivalence class of DAGs that correspond to this 
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distribution. Assume that this equivalence class contains m different DAGs. 
For each DAG Gj in this class, we can apply intervention calculus to obtain 
the causal effect 6ij of Xi on Y . We can summarize this information for 
i = 1, . . . ,p and j = 1, ... ,m in a, pxm matrix B, where each row corresponds 
to a covariate and each column corresponds to a DAG in the equivalence 
class. Since the ordering of the DAGs in the equivalence class is arbitrary, 
the columns of this matrix can be permuted in any order. It is our goal 
to estimate this matrix G. A slightly less ambitious goal is to estimate 
the multisets Gj = {9ij}je{i,...,m}} * = 1? ■ • ■ >Pi containing the possible causal 
effects of covariate Xi on Y (see Section 3.2 for the definition of a multiset). 
Note that Q contains slightly more information than Qj, i = 1, . . . ,p, since 
the columns of G tell us which possible causal effects originate from the 
same DAG, while this information is lost in the multisets Gj, i = 1, . . . ,p. 

In special cases, all values 9ij,j = 1, . . . , m, in Gj may be identical, so that 
the causal effect of Aj on Y is uniquely determined. But, even if Gj contains 
distinct values, it still contains useful causal information. For example, if 
9ij for all j = 1, ... , m, then Xj must have a causal effect on Y (positive 
or negative). Similarly, if 9ij > for all j = 1, . . . , m, then Xi must have a 
positive causal effect on Y. Finally, the minimum absolute value min,- \6ij\ 
is a lower bound on the size of the causal effect of Xi on Y. We use this 
bound to determine variable importance. 

There is a large existing literature on estimating the equivalence class 
of DAGs (e.g., [2, 3, 4, 12, 14, 30, 31] and [33]), and there is also a large 
literature on estimating causal effects when a DAG is given (e.g., [18, 19, 
23, 24] and [25]). Our new approach combines these two parts in order to 
estimate the multisets of possible causal effects Qj, i = 1, . . . ,p. We use these 
multisets to determine bounds for causal effects and causal importance of 
variables. We also show that the distinct values of Gj can be estimated by 
a new algorithm that uses only local information of the estimated CPDAG, 
thus allowing for efficient computation in very large problems, and we prove 
that this method is asymptotically consistent in sparse high-dimensional 
settings. 

The outline of this paper is as follows. In Section 2, we introduce termi- 
nology for graphs and intervention calculus. Sections 3 and 4 discuss our 
proposed methodology to estimate the multisets of possible causal effects 
Qj, i = 1, . . . ,p. Section 3 discusses so-called population versions of the algo- 
rithms that can be used if all conditional dependencies are known exactly. 
Section 4 discusses sample versions of the algorithms that can be used if the 
conditional dependencies are estimated from data. In Section 5, we prove 
asymptotic consistency of our methods in high-dimensional settings with 
certain sparsity and regularity assumptions. In Section 6, we evaluate our 
methods in a simulation study, and apply them to the riboflavin data set. 
Finally, Section 7 contains a brief discussion, Section 8 contains collected 
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proofs and the Appendix contains a description of possible modifications of 
the algorithms. 

2. Graph terminology and intervention calculus. 

2.1. Graphs. Let G = (V, E) be a graph consisting of vertices V and a set 
of edges E C V x V . In our context, the vertices represent random variables 
X±,...,X P and Y, and the edges represent relationships between pairs of 
these variables. 

An edge between two vertices, say Xj and Xj, is directed if the edge has 
an arrowhead, that is, Xj <— Xj or Xi — > Xj . An edge between Xi and Xj 
is undirected if it has no arrowhead, that is, Xi — Xj. A directed graph is 
a graph in which all edges are directed. An undirected graph is a graph in 
which all edges are undirected. A partially directed graph may contain both 
directed and undirected edges. The skeleton of a (partially) directed graph G 
is the undirected graph that is obtained from G by removing all arrowheads. 

Two vertices Xi and Xj are adjacent if there is a directed or undirected 
edge between them. The adjacency set of a vertex Xi, denoted by adjj(G), 
is the collection of all vertices that are adjacent to Xi in G. A path is any 
unbroken nonintersecting route that can be traced along the edges of the 
skeleton of the graph. A directed path is a path along directed edges that 
follows the direction of the arrows. A [directed) cycle is a (directed) path 
that starts and ends at the same vertex. A graph that contains no directed 
cycles is called acyclic. A graph that is both directed and acyclic is called a 
directed acyclic graph (DAG). A v-structure in a graph G is an ordered triple 
of vertices, say (Xi, Xj, X&), such that G contains directed edges Xi — ► Xj 
and Xj <— Xk, and Xi and X^ are not adjacent in G. The vertex Xj is then 
called a collider in this v-structure. 

Consider a partially directed graph G. Vertex Xj is said to be a parent 
of Xi in G if there is a directed edge Xj — > Xi. The set of all parents of 
Xi in G is denoted by pa,;(G). Vertex Xj is said to be a sibling of in 
G if there is an undirected edge Xi — Xj. The set of all siblings of Xi in 
G is denoted by sibj(G). For any subset S of sibj(G), we let Gs^i denote 
the graph that is obtained by changing all undirected edges Xj — Xi with 
Xj G S into directed edges Xi <— Xj , and all undirected edges Xj — Xi with 
Xj G sibj (G) \ S into directed edges Xi — > Xj . If the graph G is clear from 
the context, we write pa^ and sibj instead of pa^(G) and sibj(G). 

A DAG encodes conditional independence relationships via the notion of 
d-separation ([24], Definition 1.2.3, page 16). A distribution P is said to 
be faithful to a graph G if the conditional independence relationships of P 
are exactly the same as those encoded by G via d-separation. In general, 
the same set of conditional independence relationships can be described by 
several DAGs. These DAGs form an equivalence class, consisting of DAGs 
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with the same skeleton and the same u-structures [33]. Such an equivalence 
class can be uniquely described by a completed partially directed acyclic 
graph (CPDAG) [2]. This is a partially directed graph with the same skeleton 
as the graphs in the equivalence class in which the edges are directed as 
follows: (i) the directed edges represent arrows that are common to all DAGs 
in the equivalence class, and (ii) the undirected edges correspond to edges 
that are directed one way in some DAGs and the other way in other DAGs in 
the equivalence class. We say that a partially directed graph G is extendable 
to a DAG if its undirected edges can be directed without creating directed 
cycles or additional v -structures. 

A CPDAG can be estimated in various ways, including the PC-algorithm 
[31], search and score methods (cf. [2, 3, 4] and [33]) and Bayesian methods 
(cf. [12] and [30]). In this paper, we will use the PC-algorithm, since this 
algorithm is computationally feasible and asymptotically consistent in sparse 
high-dimensional settings [14] . We refer to [28] and [35] for a discussion about 
pointwise versus uniform consistency of the PC-algorithm. 

2.2. Intervention calculus. We now give a brief Introduction to inter- 
vention calculus, mostly based on [24] and [25]. We consider p + 1 variables 
X\,. . . , X p , Y (sometimes also referred to as X\, . . . , X p+ i). 

Any distribution that is generated from a DAG with independent error 
terms is called Markovian, with respect to the DAG. Any Markovian distri- 
bution can be factorized as 

p+1 

f(x 1 ,...,x p+1 ) = JJ f{x j \p& j ) 

3=1 

see [25], Theorem 3.1, page 297; see also [17], Section 3.2.2, for a formulation 
in terms of directed local or global Markov properties. 

In order to represent the effect of an intervention on a set of variables, [16] 
and [23] introduced so-called do or set operators. In particular, they used 
expressions of the form /(y|do(Aj = x' i )) or f(y\set{Xi =x' i )) to denote the 
distribution of Y that would occur if treatment condition Xi = x\ was en- 
forced uniformly over the population via some intervention. For a Markovian 
model, the distribution generated by an intervention do(Aj = x-) on the set 
of variables X\ , . . . , X p +\ is given by the following truncated factorization 
formula: 

{p+1 
0, otherwise, 
where /(xj |pa„) are the pre- intervention conditional distributions ([25], Corol- 
lary 3.1, page 297). Note that this formula uses the DAG structure (deter- 
mining the sets pa^ ) to write the interventional distribution on the left-hand 
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side in terms of pre-intervention conditional distributions on the right-hand 
side. 

The distribution of Y = X p+ \ after an intervention do(Xi = x'f) can be 
found by integrating out x±, . . . ,x p in (1). It can be shown that this simplifies 
to the following: 

j f(y), if^Gpa i5 

(2) /(y| do (^ = ^)) = || / (y|4p a .) / ( pa .) cZ p a . ) ifY^pa,, 

where /(■) and /(-|x^,paj) represent pre-intervention distributions ([24], 
Theorem 3.2.2, page 73). Note that the expression in (2) for Y ^ pa^ is 
a special case of so-called back-door adjustment ([24], Theorem 3.3.2, page 
79) since pa^ satisfies the back-door criterion relative to (Xi,Y) if Y £ pa,; 
([24], Definition 3.3.1, page 79). 

It is common ([24], page 70) to summarize the distribution generated by 
an intervention by its mean 

(E(Y), if^epa i; 
E(Y\do(X l = x>)) = ^j E( Y\4, pai )f( pai ) dpai , ify^pa,, 

and we can then define the causal effect of do(Xj = x'f) on Y by 

(3) ^-E{Y\do{X i = x))\ x=x , i . 

In the remainder of the paper, we consider the case that Xi, . . . ,X P ,Y 
are jointly Gaussian, and we are interested in the causal effect of X^ on Y 
for i = 1, . . . ,p. In this case, it is very simple to compute the causal effects 
as defined in (3), since Gaussianity implies that E(Y\x' i ,pa i ) is linear in x\ 
and pa^ 

EpTl^p&i) = 70 + lix'i + T^pa, 

for some values 70, 7i G K and 7 pa . G M' pai ', where |paj is the cardinality of 
the set pa^ Hence, 

J E(Y\x' i ,p& i )f(pa i )dp& i = j i x' i + J 7jLjPa</(pa*) <*P&i 

is linear in x[. Combining this with (3), it follows that the causal effect 
of Xi on Y with Y ^ pa^ is given by 7$, which is simply the regression 
coefficient of Xi in the regression of Y on X{ and pa;. In general, the causal 
effect of Xi on Y as defined in (3) is given by /3ji pa ., where, for any set 
SQ{X 1 ,...,X p ,Y}\{X i }, 

0, if Y G 5, 

coefficient of X t in Y ~ X t + S, if Y $ S, 
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(a) Example 2.1 (b) Example 2.2 

Fig. 1. Graphical representation of the models used in Examples 2.1 and 2.2. 

and Y ~ X, + S is shorthand for the linear regression of Y on Xi and S. 
Hence, in the Gaussian case, the causal effect does not depend on the value 
of x'i, and can be interpreted as 

E(Y\do(X l = x\ + 1)) - £(F|do(X = xj)) 

for any value of x\. 

2.3. Intervention calculus versus association. In the previous section, 
we discussed that, for jointly Gaussian variables, intervention effects can 
be computed using linear regression. We emphasize, however, that interven- 
tion calculus and multiple regression analysis generally give different results, 
since the set of variables that is controlled for is different. We illustrate this 
difference using two examples. In Example 2.1, the variable that appears 
to be most important in the regression analysis is least important in the 
causal analysis. Example 2.2 shows that the opposite is also possible. The 
variable that has no importance in the regression analysis is most important 
in the causal analysis. Throughout, we will use (3 to denote the regression 
parameters, and 6 to denote the intervention effects. 

Example 2.1. Consider the following model [see Figure 1(a)]: X 2 = £ 2 , 
Xi = 0.8X 2 +£1, X 3 = O.8X2 + e 3 and 

Y = —X\ + 2X 2 - X 3 + e, 

where £i,£2,£3 and e are mutually independent normal random variables 
with mean zero and variances a\ = 0.36, o~ 2 = 1, o\ = 0.36 and a 2 = 1. Note 
that Xi, X2 and X3 all have variance 1, so that we can meaningfully compare 
their regression coefficients or causal effects. 

First suppose that we apply multiple linear regression Y = a + /3iXi + 
(3 2 X 2 + P3X3 + e. Then the regression coefficients are fi\ = — 1, (3 2 = 2 and 
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/?3 = —1. Looking at the sizes of the effects, variable X 2 is most important 
in the regression analysis. 

Next, we apply intervention calculus. Let 9 = (9i,9 2 ,9 3 ), where 9i repre- 
sents the causal effect of Xi on Y. Since pa x = {X 2 }, pa 2 = and pa 3 = 
{X 2 }, we have 9i = 0i\x 2 = -1, #2 = 02\z = 0-4 and 6 3 = 0^x 2 = -1. We 
see that 9\ = 0\ and 9% = 0z , but that 9 2 7^ 02 ■ Considering the sizes of the 
causal effects, variable X 2 is least important in the causal analysis. 

Example 2.2. Let X±, X 2 and X3 be as in Example 2.1, and let 

Y = X 1 +X 3 + e 

[Figure 1(b)]. Applying multiple linear regression Y = a + 0\X\ + 2 X 2 + 
3 X 3 + e, the regression coefficients are 0\ = 1, 2 = and = 1. Looking 
at the sizes of the effects, variable X 2 is least important. 

On the other hand, if we consider intervention calculus, we get 9\ = 
0i\x 2 = I? ^2 = 02\0 = 1-6 and ^3 = /3 3 i x 2 = 1- We again see that 9\ = 0\ 
and 03 = /?3, but that 9 2 ^ 2 . Considering the sizes of the causal effects, 
variable X 2 is now most important. 

Remark 2.3. In Examples 2.1 and 2.2, Y is not a parent of any of the 
X's. For such DAGs, we can formulate the distinction between intervention 
calculus and multiple regression as follows. The causal effect 9i measures 
the total effect of variable Xi on the response Y (i.e., the sensitivity of Y 
to interventional changes in Xi). On the other hand, the regression param- 
eter 0i measures the direct effect of Xi on Y (i.e., the sensitivity of Y to 
interventional changes in Xi when all other variables in the model are held 
fixed). For a precise definition of direct effect (see [24], pages 126-127). 

3. Population versions of the algorithms. The intervention calculus dis- 
cussed in Section 2.2 assumes that the DAG that generates the distribution 
of Xi, . . . , X p , Y is known. We now present our new methodology for deter- 
mining causal effects when the DAG is unknown. First, in Section 3.1, we 
state our assumptions. In Section 3.2, we discuss our methods, assuming 
that all conditional dependencies are known exactly (hence the terminology 
population versions). Section 4 will treat sample versions of the algorithms, 
that is, versions of the algorithms that can be used if the conditional depen- 
dencies are estimated from the data. We split the exposition in these two 
parts, since this allows us to separate the main ideas of the methods (Sec- 
tion 3) from the extra complications that arise from working with estimated 
conditional dependencies (Section 4). 
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CPDAG G Gi G 2 G 3 G 4 

X2 — -X3 X2 *■ X3 X2 *• X3 X2 ■* X3 A2 *■ A3 

/\/\/\/\/\ 

Xi - X4, y Xi * x 4 y Xi *• x 4 ■*■ y x x *x 4 + y x x ^x a ^y 

Fig. 2. A CPDAG G with the DAGs Gi, . . . , G4 that are in its equivalence class. 



3.1. Assumptions. We make the following assumptions: 

(A) The distribution of (X%, . . . , X p , Y) is multivariate normal. Moreover, it 
is Markovian and faithful to the true (unknown) causal DAG. 

(B) Xi, . . . , X p have equal variance. 

The Gaussianity assumption in (A) implies that E(Y\S) is linear for any 
5 C {X±, . . . , X p }, so that the causal effects can be easily computed (see 
Section 2.2). Moreover, it allows us to equate conditional independence with 
zero partial correlation. This is useful in the PC-algorithm [31], which we 
employ to find the equivalence class of DAGs. Faithfulness is also used in 
the PC-algorithm. It makes it possible to move hierarchically from marginal 
or low-order partial correlations to higher orders, yielding a tremendous 
computational advantage if p is large. Both normality and faithfulness are 
used to prove consistency of our methods (see Section 5). Assumption (B) 
is made for convenience, so that we can easily compare the causal effects of 
different variables. 

3.2. The algorithms. In the population versions of the algorithms, we 
assume that all conditional dependencies are known exactly. In this case, 
the population version of the PC-algorithm (see [14] and [31] for a detailed 
description) yields the correct CPDAG. 

Based on this CPDAG, we can compute the sets of possible causal effects. 
Before describing the algorithms to do this, we note that the output of the 
algorithms consists of multisets. A multiset is similar to a set, with the only 
difference that in a multiset the multiplicity of elements matters. Thus, the 
multisets {a,b} and {b, a} are equal, just as the sets {a, b} and {6, a}, since 
the order of the elements does not matter. But the multisets {a, a} and {a} 
are not equal, while the sets {a, a} and {a} are. 

The basic idea of our method is given in pseudocode in Algorithm 1. We 
illustrate this algorithm by computing 0i, the set of possible causal effects 
of Xi on Y, for the CPDAG G in Figure 2. First, we list all DAGs in the 
equivalence class of G. Note that G contains the 3 undirected edges X\ — X2, 
Xi — Xi and X2 — A3. There are 8 possible ways to direct these edges, but 
some of these lead to graphs that are not in the equivalence class of G. For 
example, the configuration X\ — > X%, X\ — * A4 and X2 <— A3 is invalid, since 
this creates a new u-structure X\ — > X2 <— A3 and that is incompatible with 
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the equivalence class represented by G (see Section 2.1). Excluding such 
invalid configurations leaves four DAGs in the equivalence class of G (see 
Gi, . . . , £?4 in Figure 2). Next, for each j = 1, . . . , 4, we compute the causal 
effect Q\j of X\ on Y, assuming the data were generated from DAG Gj. 
Using (4) and assumption (A) of Section 3.1, this yields 

©1 = {011,012,013)014} = {/3l| P a 1 (Gi)> A|pa 1 (G 2 ))^l|pa 1 (G 3 )) /3l| pai (G 4 )} 

(5) 

= {Pl\0,Pl\X2) Pl\X 2 i Pi\x A }- 

Note that the parental sets of X\ in the four DAGs in the equivalence class 
of G play a crucial role in determining the possible causal effects of X± 
on Y. In particular, since pa 1 (Gi) = 0, pa^G^) = pa^G^) = {A2}, and 
pa^G^) = {A4}, the multiset 61 contains Pi\ with multiplicity 1, f3i\x 2 
with multiplicity 2, and P±\x 4 with multiplicity 1. 

The basic Algorithm 1 works well if the number of covariates is small, say 
less than 10 or so. But, if the number of covariates increases, it quickly be- 
comes infeasible to compute all DAGs in the equivalence class. We therefore 
developed a localized algorithm which is much faster. In order to explain this 
local algorithm, we first discuss a variation on the basic algorithm, given in 
pseudocode in Algorithm 2. 

Algorithm 2 is based on the idea that, for the computation of 0i, the 
parents of X± in the different DAGs in the equivalence class are of key 
importance. Therefore, we first consider the CPDAG G and determine all 
possible parental sets of X±; that is, we take all sets pa x (G) U S where 
5 C sibi(G). In Figure 2, pa x (G) = and sibi(G) = {X 2 , A 4 }, so that the 
possible parental sets of X\ are 0, {A2}, {A4} and {A2, A4}. These sets S 
determine the direction of the edges between X\ and the vertices in sibi(G). 
All edges between X\ and vertices in S must be directed toward X\, and all 
edges between X\ and vertices in sibi (G) \ S must be directed away from 
X\, exactly as in Gs—>i (see Section 2.1). For each set S, we then determine 
the number of DAGs mj to which Gs^i is extendable. As illustration, we 
compute ms for 5 = {X2} and S = {^"4}. First, note that S = {X2} implies 



Algorithm 1: Basic algorithm 
Input: CPDAG G, conditional dependencies of Xi, . . . ,X P ,Y 
Output: Matrix O of possible causal effects 

1 Determine all DAGs Gi, . . . , G m in the equivalence class of G 

2 for j = 1 to m do 

3 for i = 1 to p do 

4 Oij =ft| P a i (G j ) [see (4)] 

5 end 

6 end 
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Algorithm 2: Variation on Algorithm 1 (for instructive purposes) 
Input: CPDAG G, conditional dependencies of Xi, . . . ,X P ,Y 
Output: Multisets Oi, . . . , O p of possible causal effects 

1 for i=l to p do 

2 9i = 

3 for each subset S of sibi(G) do 

4 ms = number of DAGs to which Gs^i is extendable 

5 add m s copies of A| pai (G)us to 6, 

6 end 

7 end 



that X\ <— X2 and X% — > A4, since X2 is a parent of Ai and A4 is not. The 
undirected edge X2 — A3 in Gs^i can then be directed both ways without 
creating a new u-structure or a directed cycle. Hence, for S = {A 2 } we have 
ms = 2. On the other hand, S = {A4} implies Ai — ► A2 and X\ <— A4. In 
this case, the undirected edge A2 — A3 in Gs^i must be directed toward 
A3, since otherwise a new v -structure Ai — ► A2 <— A3 is created. Hence, 
for S = {A4} we have ms = 1. Using the same reasoning for S = and 
S' = {A2,A4}, one can easily check that the multiplicities corresponding 
to S = 0, {A 2 }, {A 4 }, {A 2 ,A 4 } are mj = 1,2,1,0. Finally, we form the 
multiset 01 by taking the elements /3i| pai (G)u5 with multiplicities ms, for 
all S C sibi(G) (where elements with multiplicity zero are omitted). Thus, 

in Figure 2, we obtain 0i = {Pi\ , Pi\x 2 , Pi\X 2 > A|xJ- 

From this construction, it is clear that Algorithm 2 gives the same output 
as Algorithm 1 (with the only difference that Algorithm 2 does not yield 
the column structure of G, telling us which causal effects originate from the 
same DAG). Note that Algorithm 2 is not faster than Algorithm 1. The new 
bottleneck is the computation of the multiplicities ms, which again quickly 
becomes infeasible if the number of covariates increases. We therefore do 
not recommend using this algorithm in practice. However, we can slightly 
modify Algorithm 2 to obtain a fast localized algorithm, given in pseudocode 
in Algorithm 3. 

The difference between Algorithms 2 and 3 is that Algorithm 3 replaces 
the computation of ms by a much simpler step which only checks if Gs—>i is 
locally valid, meaning that Gs^i does not contain an additional ^-structure 
with Aj as collider. In the example in Figure 2, Gs->i is locally valid for 
S = 0, {A2} and {A4}, and it is not locally valid for S = {A2, A4}. We then 
form a new multiset by taking all elements /9i| pai (G)uS f° r which Gs->i 
is locally valid. In the example, this results in 0f = Pi\x 2 ^i\x 4 }- 

Note that, for the CPDAG in Figure 2, the sets of distinct values in Of 
and ©i are the same, but the multiplicities are different. It turns out that 
this holds in general. To show this, we need the following lemma. 
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Algorithm 3: Local algorithm 
Input: CPDAG G, conditional dependencies of Xi, . . . ,X P ,Y 
Output: Multisets ®f , i = l,...,p 

1 for i = 1 to p do 

2 ef = 

3 for each subset S of sibi(G) do 

4 if Gs^i is locally valid (i.e., has no new v-structure with collider Xi) then 

5 add ft| P a s (G)us to @f 

6 end 

7 end 

8 end 



Lemma 3.1. Let S C sibj(G). Then Gs^i is locally valid if and only 
if there is a DAG Gj in the equivalence class of G such that p&^Gj) = 
paj(G) U 5. 

One direction of this lemma is trivial. If there is a DAG Gj in the equiv- 
alence class of G with p&^Gj) = paj(G) U S, then by definition Gj is locally 
valid and, hence, Gs^i must be locally valid. Surprisingly, the other direc- 
tion also holds, as proved in Section 8. 

Lemma 3.1 directly leads to the following result. 

Theorem 3.2. Oj and are equal when they are interpreted as sets: 

&i = ef, i = i,..., P . 

Theorem 3.2 implies that the only information we lose by using the local 
Algorithm 3 is the multiplicity of the values. The sets of distinct values in 
9f and Si are exactly the same. Implications of this result are that, for 
example, the range of possible causal effects or the minimum absolute value 
of the possible causal effects can be obtained via the local Algorithm 3. 

Remark 3.3. Note that the multiplicities of elements in and 0f 
have different meanings. The multiplicity of an element 9 in Gj corresponds 
to the number of DA Gs in the equivalence class for which the causal effect 
of Xi on Y equals 9. On the other hand, the multiplicity of an element 9' 
in ®f corresponds to the number of subsets S in the local Algorithm 3 that 
yield causal effect 9'. The cardinality of Of is always smaller or equal to the 
cardinality of Gj, since each set S in Algorithm 3 corresponds to at least 
one DAG in the equivalence class (Lemma 3.1). 
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4. Sample versions of the algorithms. Assume that we have a sample 
consisting of n i.i.d. copies of (X\, . . . , X p , Y) = (X±, . . . , X p+ i). We then ob- 
tain sample versions of the algorithms by using the estimated conditional 
dependencies of Xx, . . . ,X P ,Y as input. In the Gaussian case, we use esti- 
mated partial correlations p n ij\s between Xi and Xj given some set of other 
variables S. We then use the sample version of the PC-algorithm to estimate 
the corresponding CPDAG G [14] and [31]. This involves multiple testing 
for Z-transformed partial correlations 

A 1, f 1 + Pnij\s\ 
Z n ij\S= o lo S i -HH- 

Since Z ni j\g has a iV(0, (n — |S| — 3)™ 1 ) distribution if p^\ s = 0, we conclude 
that pij\ s ^ if 

\Z nij \ s \y/n-\S\-3>*- 1 (l - a/2), 

where $ is the standard normal distribution function and < a < 1 is a 
tuning parameter. 

Next, we use the estimated CPDAG G{a) to estimate the multisets of 
possible causal effects, by using sample versions of (4) (i.e., we use the least 
squares estimated regression coefficients). This procedure has been imple- 
mented in the R-package pcalg [15] (in the meantime, code is available from 
the authors). We denote the estimated multisets by 

Q n i(a) for the sample version of the basic Algorithm 1, 

@ ni (a) for the sample version of the local Algorithm 3 

for i = 1, . . . ,p, where we emphasize the dependence of the estimates on the 
tuning parameter a. Possible modifications of Algorithms 1 and 3 that can 
be beneficial in the sample versions of the algorithms are discussed in the 
Appendix. 

4.1. Tuning of the PC-algorithm. The tuning parameter a in the PC- 
algorithm can be chosen via a Bayesian Information Criterion (BIC). First, 
for a given choice of a, we compute the estimated CPDAG G(a). Next, we 
find a DAG G'(a) that is in the equivalence class described by G{a). Based 
on G'(a), we then compute the maximum likelihood estimators ^mle 6'{a) 
and pmle for the covariance matrix and mean vector of the Gaussian dis- 
tribution of Xi, . . . ,X p+ i (cf. [19]). Finally, we choose a to minimize 

-2*(£mle,#(«)>£mle) + lognfe 1 (±mle +p+ A , 

l<3 
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where £(■) denotes the log-likelihood of a (p + 1) -dimensional multivariate 
Gaussian distribution. We point out that the behavior of BIC is still un- 
known in the high-dimensional setting, where the dimensionality p may be 
much larger than the sample size n. 

Another approach to tune the PC-algorithm, is to choose a relatively 
large, so that the resulting graph contains a large number of edges. We 
then investigate which edges (directed or undirected) are stable under a 
subsampling procedure, where stability is measured in terms of the rela- 
tive frequency of occurrence of (directed or undirected) edges under the 
sub-sampling scheme. An edge is kept if the corresponding subsampling 
frequency is larger than a certain cut-off. Surprisingly, this cut-off can be 
determined via controlling a multiple testing error rate. Details of such a 
generic procedure are described in [22]. 

4.2. Incoherences with sample versions. Two types of incoherences may 
occur in the sample version of the PC-algorithm (but the probability of these 
incoherences converges to zero as the sample size n goes to infinity). 

First, the sample version of the PC-algorithm may produce conflicting v- 
structures. For example, the algorithm can produce u-structures X\ — > X2 <— 
A3 and X2 — > A3 <— A4, giving conflicting information about the direction of 
the edge X2 — A3. In such cases, the algorithm overwrites the u-structures in 
the order in which they were tested. Hence, the resulting structure depends 
on the order in which the independence tests are performed. Since we usually 
do not prefer one order of tests over another, we simply choose the structure 
that arises by the ordering of the variables. 

Second, the sample version of the PC-algorithm may produce invalid 
CPDAGs (i.e., CPDAGs that are not extendable). For example, the algo- 
rithm may yield a graph with undirected edges X± — X2 , X2 — A3 , A3 — A4 
and A4 — X\. This is not a valid CPDAG, since it is impossible to direct its 
edges without creating a directed cycle or a ^-structure. In other words, this 
graph does not describe an equivalence class of DAGs. While such an invalid 
CPDAG does not cause problems in the local Algorithm 3, it is problematic 
in the basic Algorithm 1, since in the latter algorithm the CPDAG has to be 
extended in order to find all DAGs in the equivalence class. In Algorithm 1, 
we solve this problem by modifying the estimated CPDAG in the following 
way. First, we search for conflicting u-structures, and we try to rearrange 
them until we get an extendable CPDAG. If this is not possible, we destroy 
as few ^-structures as possible to obtain an extendable CPDAG. If this is 
also not possible, we randomly draw a DAG on the estimated skeleton and 
work with the CPDAG that corresponds to this DAG. 

5. Asymptotic consistency. In this section, we prove asymptotic consis- 
tency of our methods in high-dimensional settings (i.e., in situations where 
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the number of covariates p can be much larger than the sample size n). We 
consider a framework where the model depends on n. We use p n to denote 
the number of covariates, G n to denote the CPDAG, and P n to denote the 
distribution of (X nl , . . . ,X npn ,Y n ) = (X nl , . . . ,X nPn ,X njPn+1 ). We assume 
that the data consist of n i.i.d. copies of (X n \, . . . , X n ^ Pn+ i) ~ P n . Regarding 
P n , we make assumption (A) of Section 3.1. Additionally, we assume the 
following: 

(C) The number of covariates p n = 0(n a ) for some < a < oo. 

(D) The maximum neighborhood size of G n , q n = maxi = i r „ iPn+ i |adjj(G n )|, 
satisfies q n = 0(n l ~ b ) for some < b < 1. 

(E) The partial correlations p n ij\s between X n i and X n j given S satisfy the 
following upper and lower bounds, uniformly over i, j € {1, . . . ,p n + 1} 
and S C {X n i, . . . ,X njPn+ i} \ {X n i,X n j}: 

(6) sup |/Onij|sl ^ M for some M < 1, 

n,i^j,S 

(7) inf 'S\pnij\s\ ■ Pmj\s + 0} > c n , 

where c" 1 = 0{n d ) for some < d < b/2 with b as in (D). 

(F) The conditional variances satisfy the following bound: 

. Vw(X ni \S) 2 

mi — — j— t— — — > v for some v > 0. 
i=i,...,p„,SCadj i (G„) \av{Y n \X ni ,S) 

Assumptions (C)-(E) were also made in [14]. Assumption (C) allows the 
number of covariates to grow as any polynomial of the sample size, represent- 
ing the high-dimensional setting. Assumption (D) is a sparseness assump- 
tion requiring that the maximum neighborhood size in the DAG grows at a 
slower rate than O(n). Condition (6) in assumption (E) excludes (sequences 
of) models in which the partial correlations approach 1, hence avoiding iden- 
tifiability problems. Condition (7) in assumption (E) requires the nonzero 
partial correlations to be outside of the n~ b / 2 range, with b as in assumption 
(D). Note that this condition is similar to Assumption 5 in [21] and condi- 
tion (8) in [36] . Finally, we note that assumption (F) is of the same spirit as 
Assumption 2 in [21]. Namely, if we scale Y n such that Var(y„) = o 1 for all 
n, then assumption (F) is implied by requiring that Var(X n j|S') > v 2 a 2 for 
all i = 1, . .. ,p n and S C adjj(G7 n ). 

Under assumptions (A) and (C)-(E), the PC-algorithm was shown to be 
consistent ([14], Theorem 2). The underlying reason for this result is the hi- 
erarchical nature of estimation and testing of partial correlations within the 
PC-algorithm. Due to sparsity and the faithfulness assumption, there is no 
need to estimate high-order partial correlations. This, together with the fact 
that the error in the estimation of partial correlations decays exponentially 



16 



M. H. MAATHUIS, M. KALISCH AND P. BUHLMANN 



fast with increasing sample size, form the key elements of the consistency 
proof for the underlying CPDAG. 

Consistency of the PC-algorithm means that there is a sequence a n such 
that P(G n {otn) = G n ) — > 1 as n — > oo. By combining this with the fact that 
for any given valid CPDAG the sample versions of Algorithms 1 and 3 per- 
form exactly the same linear regressions, the following result is immediate. 

Theorem 5.1. Under assumptions (A) and (C)-(E), there is a sequence 
a n such that for all n>\ the following holds on sets A n with P{A n ) — > 1: 

&ni(a n ) = &ni(a n ) for alli = l,... ,p. 

The next theorem shows that G n j and G^ are consistent estimators for 
Q n i and G^, respectively. 

Theorem 5.2. Under assumptions (A) and (C)-(F), there exists a se- 
quence a n such that 

SUp (^multiset (©ni («n) , ©m) ~>p °> 
i=l,...,p n 

SUp d mu iti S et (@ni ( a n) > ®ni) °) 
i=l,...,p n 

where, for any two multisets A = {a\, . . . , a m } and B = {bi, . . . , b q } with or- 
der statistics am < • • • < ai m \ and bm < • • • < b^, 

A ( A T3\ \ ■ ^ l°0-)- 6 (j)l' */™ = <?> 

"multiset (A = < J =1 v,m 

too, ifmjtq. 

The proof of Theorem 5.2 is given in Section 8. The key elements of the 
proof are similar to the ones in the consistency proof of the PC-algorithm. 
We only need to perform a limited number of low-order regression problems, 
and the estimation error we make in such problems decays exponentially fast 
when the sample size increases. 

Figure 3 illustrates the connections between Theorems 3.2, 5.1 and 5.2. 
In particular, combining Theorems 3.2 and 5.2 yields that the elements of 
G^j converge in probability to elements of G n j, uniformly over the elements 
in G^j and i = l,...,p n . Moreover, every element of Q n j is reached in this 
way. This leads to the following corollary. 

Corollary 5.3. Let /:R — >M be a continuous function. Then, under 
assumptions (A) and (C)-(F), 

sup | min{/(0) : 8 G G^J - min{/(0) : 9 G Q M }| 0. 

i=l,...,p n 
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in 



in 



(as multisets) 
6& = Qui (onA n ) 



Fig. 3. Illustration of the connections between B„ i; ©ni, O^i ffl^rf O n i, given by Theorems 
3.2, 5.1 and 5.2. 



An important implication of this corollary is obtained by taking f(x) = 
\x\, yielding that, under assumptions (A) and (C)-(F), 

(8) sup |min{|0|:0ee£j-min{|0|:0ee ni }|-> p O. 

i=l,...,p n 

The minimum absolute value of n j is a lower bound on the size of the 
causal effect of X; on Y. Equation (8) implies that we can estimate this 
bound consistently via the local method, uniformly in i = 1, . . . ,p n . 

Another implication of Corollary 5.3 follows by taking f(x) = x and 
f(x) = —x, yielding that the local method is consistent for the joint esti- 
mation of (min(0 n j),max(0 n j)) = (mm{8:8 S n j},max{0 :8 S n i}), uni- 
formly in i = l,...,p n . Hence, any continuous function g:M? — > R of 
(min(0 n j),max(0 n j)) can be consistently estimated by the local method. 
In particular, taking g(x,y) = y — x, we obtain that under assumptions (A) 
and (C)-(F) 

sup | range(0^) - range(0 ni )| 0. 

i=l,...,p n 

Thus, the range of possible causal effects of X$ on Y can be consistently 
estimated by the local method, uniformly in i= 1, . . . ,p n . 

We close this section by pointing out that not all functions of n j can 
be consistently estimated by the local method. For example, the mean of 
0^ is typically not a consistent estimate of the mean of n j, since the 
multiplicities of n , and 0^ have different meanings (see Remark 3.3). In 
our simulations, however, the local method still yielded surprisingly good 
results in such a setting (see Figure 4, left panel). 

6. Simulations and real data analysis. We now demonstrate the behavior 
of our methods in simulation studies and on a real data set. First, in Section 
6.1, we use simulation studies to examine the behavior and speed of the 
basic method (Algorithm 1) and the local method (Algorithm 3). Next, in 
Section 6.2, we apply our methods to the problem of riboflavin production 
by B. subtilis that was discussed in the Introduction. 
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Fig. 4. Comparison of the basic method (B) and the local method (L) over settings 1(a), 
1(b) and 2. The left panel shows boxplots for e^ie' an d the right panel shows boxplots for 
, k = 1, . . . , rireps (outliers excluded). The combination of the algorithm (B/L) and the 
simulation setting [1 (a)/l(b)/2/ is indicated on the x-axis. 

6.1. Simulation studies. We use the following simulation scheme. We 
generate n reps i.i.d. DAGs with edge weights for the following two settings: 

Setting 1: p + 1 = 10, en = 4, ra reps = 1000, 

Setting 2: p + 1 = 1000, en = 4 (block structure), n rcps = 100, 

where p + 1 is the number of vertices of the DAG and en is the expected 
neighborhood size of the DAG. The simulation of a single DAG with edge 
weights proceeds as follows. First, we use the R-package pcalg [15] to simu- 
late a random DAG on X\, . . . ,X p +\ with the pre-specified expected neigh- 
borhood size en. In setting 2, we enforce a special block structure on the 
DAG by letting it consist of 100 disconnected components (blocks) of 10 
variables each. Subsequently, we equip all edges Xi <— Xj with edge weights 
(3ij which are drawn independently from a Uniform([l, 2]) distribution. 

For each k = l,...,n reps in the two settings, the DAG with edge 
weights /3ij defines an underlying distribution on (x[ k \ . . . , X^^): 

let ei,...,£p + i, i.i.d. ~AA(0,1), 

(9) 

for i = l,...,p + l, setxl k) = Yl P\fxf>+ei. 

[Note that the x\ k)l s can be defined recursively as in (9), since pcalg 
automatically orders the variables in the DAGs so that pa(Ai) = and 
P&j C {Xi,...,AVi} for i = 2,...,p+l.] 

For each DAG G^ k \ we randomly choose one vertex as the response vari- 
able Y^ k > and another vertex as the covariate of interest X^ k \ We then 
determine the true multiset of possible causal effects of X^ on based 
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on the true underlying distribution of (X[ , . . . , -Xp+i ) , and denote this by 
9( fc ). In setting 2, and are randomly chosen from the same block, 
in order to allow for a more direct and fair comparison with setting 1. [If 
and F W were chosen from different blocks, then the causal effect could 
be quite easily identified as zero, giving an unfair advantage to setting 2.] 

For each DAG G^ k \ we simulate a data set consisting of n i.i.d. copies 
of (x[ k ' , . . . , Xp^ ) . We use two different sample sizes for setting 1 , and one 
sample size for setting 2: 

Setting 1: n = 20 [setting 1(a)] and n = 2000 [setting 1(b)], 

Setting 2: n = 100. 

Based on these simulated data, we compute estimates of Q^ k \ using tuning 
parameter a = 0.01 in the PC-algorithm. In settings 1(a) and 1(b), we use 
both the basic and the local algorithm. In setting 2, we only use the local 
algorithm, since the basic algorithm is infeasible. We denote the output of 
the basic algorithm by and the output of the local algorithm by Q( k ' L \ 
We compare 

g)(fc) to 0(fc) 

using the following two measures: 




= (min{|0| : § G 6 (fc) } - min{|0| : 9 G 9 (fc) }) 2 
with analogous measures for comparing 

Q(k,L) tQ e (fc) i Note that e 2(*0 

mea- 
sures the squared error in the estimation of the mean absolute value of Q^ k \ 
and e^j^ measures the squared error in the estimation of the minimum 
absolute value of Q^ k \ 

Figure 4 compares the results of the basic method and the local method, 
showing boxplots for e^ ve (left panel) and e^ in (right panel). From the dis- 
cussion following Corollary 5.3, we know that the local method is consistent 
for the minimum absolute value of @( k \ while it is typically inconsistent for 
the mean absolute value of Q^ k \ On the other hand, the basic method is 
consistent for both parameters. In light of this, it is surprising to see that 
the boxplots for the basic method and the local method are basically iden- 
tical for both measures of performance e^ ve and e^ in . We also note that 
both methods perform better in setting 1(b) than in setting 1(a) because of 
the larger sample size in setting 1(b). Finally, the performance of the local 
method deteriorates only slightly in the high-dimensional setting 2. 

In order to demonstrate the behavior of the basic method and the local 
method in more detail, we also evaluate their performance on several data 
sets that are generated from a fixed DAG with edge weights. Thus, we 
generate a random DAG G (p = 7,en = 3) with edge weights, and randomly 
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Estimated Coefficient Estimated Coefficient 



Fig. 5. The estimated effects (density plots for the output of the basic and the local 
method over 50 replicates) are compared to the true multisets of possible causal effects 
(vertical lines; heights indicate the relative frequencies of the values). The parameters in 
all four settings are p = 7, en — 3, n= 1000, a — 0.01. 

choose a covariate X and a response variable Y, as before. Next, we generate 
50 data sets of size 1000 from this DAG, according to the model given in (9). 
For each data set, we estimate the multiset of possible causal effects, using 
a = 0.01. We then aggregate these 50 estimates, and construct a density 
plot. 

Figure 5 shows the results for four typical DAGs. The true multisets of 
possible causal effects are indicated by vertical lines, where the height of 
each line indicates the relative frequency of the given value in the multiset. 
In the upper left panel, we see that both methods pick up the set of possible 
causal effects quite reliably. The basic method captures the multiplicities 
better than the local method, as expected from our theory (see Remark 3.3). 
However, this advantage of the basic method is not so clear in the upper 
right panel. The lower left panel shows an example where the true causal 
effect is zero, and this is identified correctly by both methods. Finally, the 
lower right panel shows an example where the true causal effect is unique and 
is approximately 0.63. Both methods find this effect, but they also identify 
zero as a possible causal effect. This error is caused by the fact that the 
CPDAG is estimated incorrectly for some of the 50 data sets. 

Finally, we consider the runtime of the algorithms. Table 1 shows that 
the runtime of the basic algorithm is much larger and much more volatile 
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Table 1 

Mean runtime in seconds of the basic algorithm and the local algorithm over 10 replicates 

with settings en — 3, n= 1000, a = 0.01, and the specified number of covariates p. 
Standard errors of the mean are given in parentheses. A value NA means that at least one 
of the 10 replicates took more than 48 hours to compute, so that the computation was 
aborted. All computations were carried out on a 2.6 GHz Dual-Core AMD Opteron 
processor with 32 GB RAM on red hat Linux 2.6.18, using R 2.7.2 



p = 4 


p = 9 


p = 14 


p = 29 


p = 49 


p = 99 


Basic 0.120 (0.01) 


17.6 (5.4) 


NA 


NA 


NA 


NA 


Local 0.038 (0.002) 


0.088 (0.008) 


0.15 (0.02) 


0.50 (0.06) 


0.99 (0.06) 


2.8 (0.3) 



than the runtime of the local algorithm. This was to be expected, since 
the basic algorithm has to find all DAGs within an equivalence class. In 
our implementation, graphs with 15 vertices or more cannot be handled 
reliably by the basic algorithm, while they can be handled easily by the 
local algorithm. 

6.2. Riboflavin data. We now apply our methods to a data set about 
riboflavin (vitamin B2) production by B. subtilis, kindly provided to us by 
DSM Nutritional Products (Switzerland). As discussed in the Introduction, 
the data are observational. The real-valued response variable is the loga- 
rithm of the riboflavin production rate, and there are p = 4088 covariates 
measuring the logarithm of the expression level of 4088 genes that cover 
essentially the whole genome of Bacillus subtilis. The sample size is n = 71, 
and, hence, this is a high-dimensional setting with p^> n. 

The data are of high quality, for example, in terms of a large signal to 
noise ratio in a properly regularized linear model. Furthermore, Gaussianity 
of the marginal distributions of the data seems a reasonable approximation. 
Detecting strong deviations from joint multivariate Gaussianity in such high- 
dimensional data is extremely hard, as is verification of the DAG and faith- 
fulness assumptions. A more detailed discussion about these assumptions 
can be found in Section 7. 

Due to the large number of covariates in this data set, our basic algorithm 
is infeasible, and we only apply the local algorithm. After standardizing the 
data so that all covariates have unit variance, we estimate the multiset of 
possible causal effects of each gene on the riboflavin production. We first 
analyze the number of distinct values in each of these multisets, which we 
call the ambiguity of the multiset. In high-dimensional problems, one might 
fear that these ambiguities can be very large, but this is not the case for 
the riboflavin data. Varying the tuning parameter a for the PC-algorithm 
between 0.01 and 0.5, there is no gene in the pool of 4088 genes with an 



22 



M. H. MAATHUIS, M. KALISCH AND P. BUHLMANN 



Table 2 



The fraction 


of the 4088 genes 


in the ribc 


flavin data set with a certain ambiguity 


a, for 




various 


values of the tuning parameter 


a 






a = 1 


a = 2 


a = 3 


a = 4 


a = 5 


a = 0.01 


0.775 


0.186 


0.036 


0.004 


0.001 


a = 0.05 


0.845 


0.120 


0.029 


0.005 


0.001 


a = 0.1 


0.897 


0.085 


0.016 


0.002 





a = 0.2 


0.951 


0.042 


0.005 


0.002 





a = 0.3 


0.970 


0.025 


0.003 


0.002 





a = 0.4 


0.974 


0.023 


0.002 


0.001 





a = 0.5 


0.981 


0.018 


0.001 









ambiguity greater than 5, and the large majority of genes have ambiguity 1 
(i.e., they yield a unique estimate for the causal effect; see Table 2). 

In the remainder of the analysis, we set the tuning parameter a to 0.01. 
In order to obtain a single estimate for the causal effect of each gene, we 
compute the minimum absolute value of its estimated multiset. As discussed 
before, this is a consistent estimate for the minimum absolute value of the 
true multiset of possible causal effects of the gene (under our assumptions) . 
In order to assess the reliability of these estimates, we bootstrap the data 
10 times and take the median of the 10 estimates for each gene. We call the 
resulting values the causal scores of the genes. Figure 6 shows a histogram of 
these causal scores. Note that the histogram has a strong right tail, indicat- 




FlG. 6. Histogram of the causal scores (median of the minimum absolute effect over 10 
bootstrap samples) for the 4088 genes in the riboflavin data set. All genes to the right of 
the vertical line have a local FDR of less than 10%. 
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ing that there is a group of genes with strongly estimated causal effects that 
are stable in a bootstrap analysis. In order to decide which causal scores 
should be considered "significantly high," we use the local false discovery 
rate (FDR) [7]. The vertical line in Figure 6 shows the cut-off for a local 
FDR of 10%. About 200 of the 4088 genes fall to the right of this cut-off, and 
hence have a local FDR that is less than 10%. According to our analysis, 
these genes are promising candidates for genetic modification. 

We compare our method to an association approach using regression, 
which is, as we have argued before, inappropriate for inferring causal effects. 
To cope with high-dimensional variable selection in a linear model, we use 
the (prediction optimal tuned) Lasso; properties of the Lasso for variable 
selection in regression are discussed in [21] and [36]. Among the top ten genes 
of Lasso (ordered by absolute values of estimated regression coefficients), we 
found only one gene that was also among the top ten genes of our method 
(ordered by the causal scores). This difference is due to the fact that causal 
effects and associations can be very different. To evaluate whether the results 
of our intervention approach are superior in practice, we would need to 
compare the intervention effects that we computed from the observational 
data to intervention effects computed from lab experiments in which the 
interventions were actually carried out. We are currently in the process of 
doing this. However, already at this point we can say that if the target is 
prediction of intervention effects or causal effects, an association analysis 
like regression will not provide an appropriate answer. 

7. Discussion. In this paper, we present a new method that combines 
estimation of the equivalence class of DAGs with causal inference methods 
that can be used when the DAG is known. The need for such a combination 
is due to the fact that, for a large class of practical problems, it is unrealistic 
to assume that the graph structure among the variables of interest is known. 
Thus, we assume that we have observational data that were generated from 
an unknown DAG, and based on these data we want to estimate causal 
effects. We argue that, in this situation, causal effects can typically not be 
uniquely determined, and we focus our estimation on the multisets of 
possible causal effects of Xi on Y, i = 1, . . . ,p. Summary measures of Oj 
can be used to determine variable importance. In particular, we propose 
using the minimum absolute value of Oj, since this is a lower bound on the 
size of the causal effect of Xi on Y . We show that the distinct values of Oj 
can be estimated by a fast local method, which is computationally feasible 
and asymptotically consistent in sparse high-dimensional settings. Thus, we 
achieve consistent estimation, based on observational data, for the lower 
bound of the size of each individual covariate's total causal effect on Y. 

The motivation for our work comes from a problem about genetic en- 
gineering of Bacillus subtilis in order to improve its riboflavin production 
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rate. The response variable of interest is the riboflavin production rate, and 
there are p = 4088 covariates (genes) from which we have expression lev- 
els. Based on these observational data, our goal is to find genes that are 
good candidates for single-gene interventions that improve the riboflavin 
production rate. With our new method, we find a list of genes whose top- 
ranking members are surprisingly stable (when doing a bootstrap analysis) 
and clearly relevant in terms of a local false discovery rate. Furthermore, our 
list of genes with large lower bounds for their causal effects is markedly dif- 
ferent from a regression approach, which measures only association (instead 
of intervention or causality). 

One should be careful not to over-interpreting our results. We have shown 
that, within the class of Gaussian distributions that are faithful to the true 
(unknown) causal DAG, it is possible to estimate good lower bounds for 
causal effects on the basis of observational data. In practice, it is hard or 
impossible to check whether our assumptions hold, at least in an approxi- 
mate sense. 

The Gaussian assumption is conceptually not a key assumption. For non- 
Gaussian data, the PC-algorithm can still be used to estimate the equiva- 
lence class of DAGs, and the causal effects are still given by (3) [but they will 
not be constant in general, and depend on the value of x\ in (3)]. It is also 
interesting that in a certain sense the Gaussian assumption makes things 
more difficult. If the linearity assumption is retained and the errors are as- 
sumed to be non-Gaussian, then the DAG can be uniquely recovered [29], 
preventing the need to work with equivalence classes. We note, however, that 
Gaussianity is essential for our consistency proofs of the algorithms in high- 
dimensional settings. Consistency of the PC-algorithm for high-dimensional 
non-Gaussian data is still an open problem, and consistent estimation of the 
causal effects also seems nontrivial in general, involving function estimation. 

The DAG assumption implicitly includes the assumption that we have 
no unmeasured confounders. This is a very strong assumption in general. 
In our particular example from biology, this assumption may be reasonable 
though, since we observe the expression levels from essentially all genes of the 
Bacillus subtilis genome (but of course there may still be some unobserved 
aspects of the genome). The presence of hidden variables can lead to various 
problems. First, since the class of graphical Markov models is not closed 
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Fig. 7. Examples illustrating that our approach can lead to erroneous conclusions in the 
presence of hidden variables. 
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under marginalization [26], it can happen that there is no CPDAG that 
represents all and only all conditional independence relationships among 
the observed variables. This causes problems in the PC-algorithm with con- 
flicting u-structures, even for infinite sample sizes. Hence, the presence of 
conflicting -y-structures for data sets where n is large relative to p may be 
interpreted as a warning sign for the presence of hidden variables. But even 
without conflicting v -structures, hidden variables can lead to erroneous con- 
clusions. To see this, suppose that data are generated according to the DAG 
in Figure 7(a), and that variable L\ is not observed. Then, the causal effect 
of X\ on Y is not identifiable, because any part of the observed correlation 
between X\ and Y can be explained by L\. However, applying the PC- 
algorithm to the distribution of the observed variables (Xi , Y) yields the 
CPDAG X\ — Y, and our approach wrongly concludes that the causal effect 
of X\ on Y is contained in the set {O,@i\ }. As a second example, suppose 
that data are generated according to the DAG in Figure 7(b) and that vari- 
ables L\ and L2 are not observed. From the DAG, it is clear that neither 
X\ nor X2 have a causal effect on Y. But the distribution of the observed 
variables (Xi,X2,Y) is faithful to the DAG X\ — > Y <— X<i- Hence, applying 
the PC-algorithm to the distribution of the observed variables yields this 
DAG, and our approach wrongly concludes that the causal effects of X\ 
on Y and X2 on Y are unique and equal to Px\ss an d /?2|0> respectively. 
The problem in these two examples is that, although the CPDAGs X\ — Y 
and X\ — > Y <— X2 represent all and only all conditional independence rela- 
tionships among the observed variables, these CPDAGs may no longer be 
interpreted causally. Relaxing the assumption of unmeasured confounders is 
possible by extending our methodology to ancestral graphs (see [26, 27] and 
[34]) which allow for hidden variables. However, deriving bounds for causal 
effects when the underlying ancestral graph is unknown is an open issue. 

Another interesting direction of future research consists of considering 
other types of causal effects than the total effect of a single covariate on 
a response, such as the direct effect of a single covariate on a response or 
the total effect of a joint intervention on several covariates. The conceptual 
idea of our basic algorithm (Algorithm 1) can be used for any type of causal 
effect. But our local algorithm (Algorithm 3) relies on the fact that the total 
effect of a covariate X on a response Y can be computed if one knows the 
parents of X. Such a simple relationship does not hold in general for other 
types of causal effects, and it would be interesting to investigate for which 
other types of causal effects a local approach like the one in Algorithm 3 
could be used. 

We conclude by coming back to our practical problem of riboflavin pro- 
duction by Bacillus subtilis. This problem is of a causal or interventional 
type, and, hence, our intervention approach is more appropriate than a 



26 



M. H. MAATHUIS, M. KALISCH AND P. BUHLMANN 



regression-type association analysis using high-dimensional variable selec- 
tion in a linear model. Therefore, despite open issues in the difficult field of 
inferring bounds for causal effects, our new approach offers both conceptual 
and practical improvements. 

8. Proofs. In order to prove Lemma 3.1, we need some more graph the- 
ory and terminology. Consider an undirected graph G = (V,E). For any 
subset V' C V, the subgraph induced by V' is Gy = (V',Eyi), where Eyi 
is the set of all edges in E whose endpoints are both in V'. G is called 
chordal (or triangulated) if each of its cycles of length four or more has a 
chord, which is an edge joining two nonconsecutive vertices in the cycle. G 
is called complete if every pair of distinct vertices is adjacent. A clique is a 
set of vertices so that every pair of distinct vertices in this set is adjacent. A 
vertex is simplicial if its adjacency set forms a clique. A perfect elimination 
scheme of a graph G is an ordering a = (v\, . .. ,v n ) of its vertices, so that 
each Vi is a simplicial vertex in the induced subgraph Gi Vu _ ^ Vn \. 

Chordal graphs have many nice properties. We will use the following (cf. 
[1, 6] and [9]): 

1. Every chordal graph G has a simplicial vertex. If G is not complete, then 
it has at least two nonadjacent simplicial vertices. 

2. Chordality of graphs is a hereditary property: if G = (V,E) is chordal, 
then all subgraphs of G induced by V' C V are chordal. 

3. Every chordal graph has a perfect elimination scheme. 

We also note that we can turn an undirected graph into a DAG with- 
out -y-structures by directing its edges according to a perfect elimination 
scheme a = (v\, . . . ,v n ): for any vertex Vi, determine the adjacency set of vi 
in Gi v . Vn \, and for each vertex Vj in this adjacency set, direct the edge 
Vj-Vi toward v^. Note that the ordering of the vertices ensures that we can- 
not create directed cycles. Moreover, we cannot create u-structures, since 
the adjacency set of V{ in Gi v . Vn \ is a clique for alH = 1, . . . , n. 

Proof of Lemma 3.1. Let i £ {1, . . . ,p}, and let S C sibj(G). We only 
prove the nontrivial direction of the lemma. We assume that Gs—>i is locally 
valid, and we show that there is a corresponding DAG G* in the equivalence 
class with pa^G*) = pa^G) U S. 

First, we note that Xj U S is a clique. This is trivial if S = 0. If S ^ 0, take 
an arbitrary vertex v in S. Since S C sibj(G), v is adjacent to JQ. It must 
also be adjacent to all other vertices in S, since, otherwise, Gs^i contains 
a new -y-structure with Xj as collider, and this contradicts the assumption 
that Gs^i is locally valid. 

Next, we use the following facts that were proved in [20] , Proof of Theorem 
3: (i) no orientation of the edges not oriented in G will create a directed cycle 
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which includes an edge or edges that were oriented in G, (ii) no orientation of 
an edge not directed in G can create a new u-structure with an edge that was 
oriented in G and (iii) the subgraph G' of G, obtained by removing all of the 
oriented edges in 67, is the union of disjoint chordal graphs. Combining these 
facts implies that any orientation of the edges in G' that does not create 
directed cycles or f-structures corresponds to a DAG in the equivalence class 
of G. Moreover, in order to find such an orientation, each of the disjoint 
chordal graphs in G' can be considered separately. 

Let G'{, . . . , G" d be the collection of disjoint chordal graphs constituting G' . 
Without loss of generality, we assume that Xi is contained in G'{. Since 
G'2 , • • • , G'd are chordal, we can find a perfect elimination scheme for each of 
these graphs and order their edges accordingly. We need to be more careful 
with G'{, since we need to find a direction of the edges so that all and 
only all vertices in S are parents of Xi. In terms of a perfect elimination 
scheme, this means that we need to order the vertices V{' of G'{, such that 
all vertices in V" \ {{Xi U S}) are ordered before Xi, and all vertices in 
S are ordered after Xj. If G'[ is complete, then such an ordering exists 
trivially, since any ordering of the vertices of a complete graph is a perfect 
elimination scheme. If G'{ is not complete, then there must be at least two 
nonadjacent simplicial vertices. Since {Xi} U S is a clique, at least one of 
these vertices must be in V" \ {{Xi U S}). We take such a vertex, say v±, 
as the first vertex in the perfect elimination scheme. Next, we consider the 
induced subgraph Gy/>\{ vi y. This graph is again chordal, since chordality 
is a hereditary property. If Gyn\i vi \ is complete, then we are done. If it is 
not complete, then we take a simplicial vertex in {V" \ {v±}) \ {{Xi} U S) 
as the next vertex in the elimination scheme. We repeat this procedure 
until it terminates, which is guaranteed to happen for some graph Ga with 
A D ({X} U S), since {Xi} U S is a clique. 

Directing the edges of G'{, . . . , G d according to the resulting perfect elimi- 
nation schemes yields a DAG without u-structures and with the same skele- 
ton as G' , where all and only all vertices in S are parents of Xi. Hence, using 
this direction of edges in the original CPDAG G yields a DAG G7* that is in 
the equivalence class of G and that satisfies paj(G7*) = paj(G) US. □ 

In order to prove Theorem 5.2, we need the following lemma. 

Lemma 8.1. Assume that assumptions (A) and (C)-(F) hold. Then, for 
every e > 0, we have 

SUP P{\Pni\S ~ Pni\s\ >£) 

i=l,...,p n ,SCadj i (G„) 

c 

< — exp(-C7 2 e 2 (?i - q n - 1)) + 2exp(-C 3 (n/2 - q n - 1)), n > N, 
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where N > is a constant depending on q n [see assumption (D)y, C\^C% ^> 
are constants depending on v [see assumption (F) J, and C3 > is an absolute 
constant. 



PROOF. Let i 6 {1, . . . ,p n }, S C adj i (G n ), and e > 0. If Y n G 5, then 
Pni\S = @ni\S = 0- Hence, we assume Y n ^ 5. In that case f3 ni \ s is the esti- 
mated regression coefficient of X n i in the regression of Y n on X n i and 5, 
and [3 n i\s is the true regression coefficient of X ni in the regression of Y n on 
X n i and 5. 

We first analyze the distribution of $ n i\s\{^m, S}. Let cr 2 ^ 5 denote the 
variance of Y n \{X n i, S}, and let cr^ s denote the variance of X n i\S. More- 
over, let s ni denote the sample variance of X ni [using (n — 1) in the denom- 
inator], let s^j|5 denote the sample variance of X n i\S (using the residuals 

in the regression of X n i on 5, with n — \S\ — 1 in the denominator), and let 
^ni\s denote the sample multiple correlation coefficient between X n i and S. 
Then, 

(10) Vav0 nils \{X ni ,S})- 1 aly ^ S ^ 



l-^ |s (n-l)^ (n-|S|-l)*V 

where the first equality is a well-known identity, and the second equality 
follows from 1 - = {{n - \S\ - l)s 2 nils }/{(n - 1)4 J. Combining (10) 

with E((3 ni \s\{X n i, S}) = P n i\s and assumption (A), we obtain 



(11) P(\$ ni]s - (3 ni{s \ > e\{X ni ,S}) = p(\Z\ > |S| {X ni ,S} 

V a ny\i,S 

where Z is a standard normal random variable. 

We first analyze (11) on the event B niS = {X ni , S : s^s > \ a ni\s^- Using 
assumption (F), we obtain 



ey/n- S - ls nils 



ny\i,S 



{X n i, S} ) 1b 



<P[\Z\>e v y^ 



V2 



<P(\Z\>Cey/n-q n -l), 

where C depends on v, and q n is as in assumption (D). Using the well-known 
bound on tail probabilities of the standard normal distribution -P(|^| > a) < 
2/(\/27ra) exp(— a 2 /2), the last display is bounded above by 

— exp(-C 2 e 2 (n - q n - 1)) 



INTERVENTION EFFECTS FROM OBSERVATIONAL DATA 29 

for all n > q n + 2, where C\, C2 > are constants depending on v. 
Next, we compute an upper bound for P(B^ iS ). Note that 

p ( b£ s | S ) = p ( '"-' s ' 2 - 1 '^ < 1 ( „ _ | S | _ 1)]s 

^ "ni I S 

= P(xL\S\-i<(n-\S\-l)/2\S) 

<P(£- qn -i<{n-l)/2), 

where x\ denotes a chi-squared random variable with k degrees of freedom. 
We now apply Bernstein's inequality ([32], Lemma 2.2.11, page 103) by 
writing 

P(xl- qn -i <(n- l)/2) = P(xl- qn -i ~ (n - q n - 1) < -(n - l)/2 + g„) 

< ^(IXn-,„-i - (n " 9n - 1)1 > (" - l)/2 - g„). 
By viewing a Xn_ gn _i — (n — q n — 1) random variable as the sum of n — g n — 1 
independent centered Xi random variables and noting that a centered \\ 
random variable satisfies the moment conditions of Bernstein's inequality, it 
follows that the last display is bounded above by 

where C'^,C 4 > are constants coming from the moment conditions. This 
expression is in turn bounded above by 2exp(— C^{n/2 — q n — 1)) for all n 
such that (n — l)/2 — q n > C' 3 . Since this bound holds for all S with \S\ <q n , 
it also holds for the unconditional probability P(B^ iS ). 
The result now follows from putting the parts together: 

P(\Pni\S ~ Pni\s\ >£) 

< / P(\P m \s-Pm\s\>e\{X nu S})l BntS dF x _s + P(Bg s ) 

< Q exp(-C7 2 e 2 (n -q n - 1)) + 2exp(-C 3 (n/2 - q n - 1)), 
where Fx ni ,S denotes the distribution of (X ni ,S). □ 

PROOF of Theorem 5.2. Let e > 0. By consistency of the PC-algorithm 
([14], Theorem 2), there is a sequence a n , such that P(A n ) — > 1 for the event 
A n = {G n (a n ) = G n }. Hence, it is sufficient to show that 

(12) lim P[ sup d mu itiset(0m(Q!n),@m) > £,A n ) -> and 

(13) lim P( sup d m ultiset(©^(«n),®ni) > e > ) ^ °- 
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In the remainder of the proof, we suppress the dependence of a n in the nota- 
tion. We first consider the local method. On the event A n , the cardinalities 
an d I ©nil OI " the multisets 0^ and 6^ are equal. Hence, 



(14) 



P ( SUp (i mu i t i sc t (6^, Q^) > 6, A, 
\=l,...,p n 



P( sup \§% m -e% m \>e,A r 

i=i,..., P „j=i,...,|e^l 



where @ni(j) an< ^ ®ni(j) are ^ ne or der statistics of 0^ and G^,, respec- 
tively. Moreover, on the event A n , we have that for every i = l,...,p n 
and j = 1,...,|Q^|, 9^ i{j) = $ ni \ pa .(G n )uS for some s S sib i (G n ). Hence, 

^niO) = Pni\S> for some 5" ^ adjj(G n ). Note, however, that and 0^) 

do not need to correspond to the same set S, since it may happen that 
®ni(j) = Pni\S, °ni(j) = Pni\S r i and Pni\S + Pni\s>- But > since the pairing of the 
elements of 0^ and G^, with respect to their order statistics, is an opti- 
mal pairing for the supremum distance, the following inequality holds for all 
i = l,...,p n : 

SU P \Qni(j)- e ni(j)\< SU P \Pni\S ~ Pni\sV 
i=l,...,|e^| SCadj^Gn) 

Combining this with (14) yields 

P( SUp d m ultisct(0nji ®ni) > £ ' 
\=l,...,p n 



(15) 



<P[ Slip \Pni\S - Pni\s\ >£ 

«=!,.. ..p^SCadj^Gn) 



Pn 



<E E ^(lA»|s-/9ni|sl>e) 

^lSCadj^Gn) 

<p n 2 q " sup P(|/9 ni |s - >e), 

i=l,...,p„,5Cadjj(G„) 

where the last inequality follows from the fact that the number of possible 
subsets of adjj(G n ) is bounded above by 2 9n , where q n is given in assump- 
tion (D). Using Lemma 8.1 and assumptions (C) and (D), it follows that 
expression (15) converges to zero as n — > oo. This completes the proof of 
(13), yielding consistency of the local method. 

We can use the same reasoning for the basic method. To see this, we note 
that on the event A n the estimated CPDAG is a valid CPDAG. Hence, the 
sample versions of the basic and the local algorithm perform exactly the 
same linear regressions (cf. Theorem 5.1). The only difference in the output 
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of the two algorithms lies in the multiplicities of the values. But since the 
estimated CPDAG is correct, the multiplicities of the sample version of the 
basic algorithm are correct, and they do not affect expression (12). □ 

APPENDIX: POSSIBLE MODIFICATIONS OF THE ALGORITHMS 

We first introduce some new notation. Let pa i y (G) be the set of vertices 
in pa,(G) from which there is a path to Y. Similarly, let sibj )2/ (G) be the set 
of vertices in sibj(Cr) from which there is a path to Y. 

We now discuss two modifications that can be made to the basic Algo- 
rithm 1: 

(i) Replace line 4 of Algorithm 1 by: "If Gj does not contain a directed 
path from Xj to Y, then set 0y = 0. Otherwise, set 6ij = /3j| pa .(G.,)-" Since 
the causal effect of Xj on Y is by definition zero if there is no directed 
path from Xj to Y , this modification does not change the output of the 
population version of the algorithm. In the sample version, however, it allows 
us to estimate exact zeroes, eliminating estimation error from the regression 
estimates when there is no directed path. 

(ii) Replace pa^Gj) in line 4 of Algorithm 1 by pa i (Gj). Since both 
pa i (G J ) and pa i y (Gj) satisfy the back-door criterion with respect to (Xj, Y), 
the output of the population version of the algorithm is again unchanged. 
In the sample version, this modification can be used to reduce the dimen- 
sionality of the regression problems. 

One can also make several modifications to the local Algorithm 3: 

(i) Before line 2 of Algorithm 3, test whether the CPDAG G allows a 
directed path from Xj to Y (i.e., test whether it is possible to direct the 
undirected edges of G so that a directed path from Xj to Y is created, 
without creating additional ^-structures or directed cycles). If G does not 
allow such a path, set Oj = {0}. If G does allow such a path, perform lines 2- 
7 of Algorithm 3. This modification may change the output of the population 
version of the algorithm, in the sense that the cardinality of Oj may change 
if G does not allow a directed path. In such the cardinality is always 
1 in the modified version, while it equals the number of sets S for which 
Gs-ti is locally valid in the original version. However, the distinct values 
in Oj do not change. In the sample version, this modification is useful for 
the same reason as modification (i) of Algorithm 1. It allows us to estimate 
exact zeroes, without estimation error from the regression problems. 

(ii) Replace sibj(G) in line 3 of Algorithm 3 by sibj i2/ (G). This substitu- 
tion may again change the multiplicities of values in Oj, but not the distinct 
values. This modification can be beneficial for two reasons. First, the algo- 
rithm may become faster, since one has to consider fewer subsets S in line 
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3 of Algorithm 3. Second, the dimensionality of the regression problems is 
reduced. 

(hi) Replace pa i (G) in line 5 of Algorithm 3 by pa iy (G). This can be 
done for the same reasons as modification (ii) of Algorithm 1. 

In both Algorithm 1 and Algorithm 3, one could also first determine the 
maximum likelihood estimator (MLE) for the estimated CPDAG, and then 
use this to compute the intervention effects. 

We note that all modifications mentioned above are nonlocal, in the sense 
that they require more information about the CPDAG than the neighbor- 
hood of Xi, for example, about paths between Xi and Y [modification (i)- 
(iii) for Algorithm 3], or about the entire CPDAG (when using the MLE). 
Whether such nonlocal modifications are improvements over the local Al- 
gorithm 3 depends on the situation. When n is large relative to p, and, 
hence, the CPDAG is estimated well, then the nonlocal modifications tend 
to behave better than Algorithm 3. On the other hand, when n is not so 
large relative to p, and the CPDAG is not estimated with very high accu- 
racy, then the nonlocal modifications tend to behave worse than Algorithm 
3. The reason for the latter is that in such situations, the nonlocal mod- 
ifications are more likely to use incorrect information from the estimated 
CPDAG than the local methods. For example, consider the situation that 
the true CPDAG contains one directed path from Xi to Y, with Xi and Y 
far apart from each other. Now, if the direction of one edge on this path is 
reversed in the estimated CPDAG (and no other errors are made that cre- 
ate another directed path between Xi and Y), then modification (i) would 
wrongly conclude that the effect of Xi on Y is zero. The local Algorithm 3 
is more robust against such estimation errors, since it only requires correct 
estimation of the neighborhood of Xi and not of the entire path between Xi 
to Y. 
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